18 research outputs found

    Low-Cost Deep Convolutional Neural Network Acceleration with Stochastic Computing and Quantization

    Get PDF
    Department of Computer Science and EngineeringFor about a decade, image classification performance leaded by deep convolutional neural networks (DCNNs) has achieved dramatic advancement. However, its excessive computational complexity requires much hardware cost and energy. Accelerators which consist of a many-core neural processing unit are appearing to compute DCNNs energy efficiently than conventional processors (e.g., CPUs and GPUs). However, a huge amount of general-purpose precision computations is still tough for mobile and edge devices. Therefore, there have been many researches to simplify DCNN computations, especially multiply-accumulate (MAC) operations that account for most processing time. Apart from conventional binary computing and as a promising alternative, stochastic computing (SC) was studied steadily for low-cost arithmetic operations. However, previous SC-DCNN approaches have critical limitations such as lack of scalability and accuracy loss. This dissertation first offers solutions to overcome those problems. Furthermore, SC has additional advantages over binary computing such as error tolerance. Those strengths are exploited and assessed in the dissertation. Meanwhile, quantization which replaces high precision dataflow by low-bit representation and arithmetic operations becomes popular for reduction of DCNN model size and computation cost. Currently, low-bit fixed-point representation is popularly used. The dissertation argues that SC and quantization are mutually beneficial. In other words, efficiency of SC-DCNN can be improved by usual quantization as the conventional binary computing does and a flexible SC feature can exploit quantization more effectively than the binary computing. Besides, more advanced quantization methods are emerging. In accordance with those, novel SC-MAC structures are devised to attain the benefits. For each contribution, RTL implemented SC accelerators are evaluated and compared with conventional binary implementations. Also, a small FPGA prototype demonstrates the viability of SC-DCNN. In a rapidly changing and developing deep learning world headed by conventional binary computing, multifariously enhanced SC, though not as popular as binary, is still competitive implementation with its own benefits.ope

    Bitstream-Based Neural Network for Scalable, Efficient, and Accurate Deep Learning Hardware

    Get PDF
    While convolutional neural networks (CNNs) continue to renew state-of-the-art performance across many fields of machine learning, their hardware implementations tend to be very costly and inflexible. Neuromorphic hardware, on the other hand, targets higher efficiency but their inference accuracy lags far behind that of CNNs. To bridge the gap between deep learning and neuromorphic computing, we present bitstream-based neural network, which is both efficient and accurate as well as being flexible in terms of arithmetic precision and hardware size. Our bitstream-based neural network (called SC-CNN) is built on top of CNN but inspired by stochastic computing (SC), which uses bitstreams to represent numbers. Being based on CNN, our SC-CNN can be trained with backpropagation, ensuring very high inference accuracy. At the same time our SC-CNN is deterministic, hence repeatable, and is highly accurate and scalable even to large networks. Our experimental results demonstrate that our SC-CNN is highly accurate up to ImageNet-targeting CNNs, and improves efficiency over conventional digital designs ranging through 50-100% in operations-per-area depending on the CNN and the application scenario, while losing <1% in recognition accuracy. In addition, our SC-CNN implementations can be much more fault-tolerant than conventional digital implementations

    Cost-effective stochastic MAC circuits for deep neural networks

    No full text
    Stochastic computing (SC) is a promising computing paradigm that can help address both the uncertainties of future process technology and the challenges of efficient hardware realization for deep neural networks (DNNs). However the impreciseness and long latency of SC have rendered previous SC-based DNN architectures less competitive against optimized fixed-point digital implementations, unless inference accuracy is significantly sacrificed. In this paper we propose a new SC-MAC (multiply-and-accumulate) algorithm, which is a key building block for SC-based DNNs, that is orders of magnitude more efficient and accurate than previous SC-MACs. We also show how our new SC-MAC can be extended to a vector version and used to accelerate both convolution and fully-connected layers of convolutional neural networks (CNNs) using the same hardware. Our experimental results using CNNs designed for MNIST and CIFAR-10 datasets demonstrate that not only is our SC-based CNNs more accurate and 40???490?? more energy-efficient for convolution layers than conventional SC-based ones, but ours can also achieve lower area???delay product and lower energy compared with precision-optimized fixed-point implementations without sacrificing accuracy. We also demonstrate the feasibility of our SC-based CNNs through FPGA prototypes

    A New Stochastic Computing Multiplier with Application to Deep Convolutional Neural Networks

    No full text
    Stochastic computing (SC) allows for extremely low cost and low power implementations of common arithmetic operations. However inherent random fluctuation error and long latency of SC lead to the degradation of accuracy and energy efficiency when applied to convolutional neural networks (CNNs). In this paper we address the two critical problems of SC-based CNNs, by proposing a novel SC multiply algorithm and its vector extension, SC-MVM (Matrix-Vector Multiplier), under which one SC multiply takes just a few cycles, generates much more accurate results, and can be realized with significantly less cost, as compared to the conventional SC method. Our experimental results using CNNs designed for MNIST and CIFAR-10 datasets demonstrate that not only is our SC-based CNN more accurate and 40X~490X more energy-efficient in computation than the conventional SC-based ones, but ours can also achieve lower area-delay product and lower energy compared with bitwidth-optimized fixed-point implementations of the same accuracy

    Efficient High-Level Synthesis for Nested Loops of Nonrectangular Iteration Spaces

    No full text
    Most existing solutions to pipelining nested loops are developed for general purpose processors, and may not work efficiently for field-programmable gate arrays due to loop control overhead. This is especially true when the nested loops have nonrectangular iteration spaces (IS). Thus we propose a novel method that can transform triangular IS-the most frequently found type of nonrectangular IS-into rectangular ones, so that other loop transformations can be effectively applied and the overall performance of nested loops can be maximized. Our evaluation results using the state-of-the-art Vivado high-level synthesis tool demonstrate that our technique can improve the performance of nested loops with nonrectangular IS significantly.clos

    DPS: Dynamic Precision Scaling for Stochastic Computing-Based Deep Neural Networks

    No full text
    Stochastic computing (SC) is a promising technique with advantages such as low-cost, low-power, and error-resilience. However so far SC-based CNN (convolutional neural network) accelerators have been kept to relatively small CNNs only, primarily due to the inherent precision disadvantage of SC. At the same time, previous SC architectures do not exploit the dynamic precision capability, which can be crucial in providing efficiency as well as flexibility in SC-CNN implementations. In this paper we present a DPS (dynamic precision scaling) SC-CNN that is able to exploit dynamic precision with very low overhead, along with the design methodology for it. Our experimental results demonstrate that our DPS SC-CNN is highly efficient and accurate up to ImageNet-targeting CNNs, and show efficiency improvements over conventional digital designs ranging in 50???100% in operations-per-area depending on the DNN and the application scenario, while losing less than 1% in recognition accuracy

    Automated Log-Scale Quantization for Low-Cost Deep Neural Networks

    No full text
    Quantization plays an important role in deep neural network (DNN) hardware. In particular, logarithmic quantization has multiple advantages for DNN hardware implementations, and its weakness in terms of lower performance at high precision compared with linear quantization has been recently remedied by what we call selective two-word logarithmic quantization (STLQ). However, there is a lack of training methods designed for STLQ or even logarithmic quantization in general. In this paper we propose a novel STLQ-aware training method, which significantly outperforms the previous state-of-the-art training method for STLQ. Moreover, our training results demonstrate that with our new training method, STLQ applied to weight parameters of ResNet-18 can achieve the same level of performance as state-of-the-art quantization method, APoT, at 3-bit precision. We also apply our method to various DNNs in image enhancement and semantic segmentation, showing competitive results

    Mapping imperfect loops to coarse-grained reconfigurable architectures

    No full text
    Nested loops represent a significant portion of application runtime in multimedia and DSP applications, an important domain of applications for coarse-grained reconfigurable architectures (CGRAs). While conventional approaches to mapping nested loops utilize only a single-dimensional pipelining, which is either along the innermost loop or along an outer loop, in this paper, we explore an orthogonal approach of pipelining along multiple loop dimensions by first flattening the loop nest. To remedy the inevitable problem of repetitive outer-loop computation in flattened loops, we present a small set of special operations that can effectively reduce the number and frequency of micro-operations in the pipelined loop. We also present a loop transformation technique that can make our special operations applicable to a broader range of loops, including those with triangular iteration spaces. Our experimental results using imperfect loops from StreamIt benchmarks demonstrate that our special operations can cover a large portion of operations in flattened loops, improve performance of nested loops by nearly 30% over using loop flattening only, and achieve near-ideal executions on CGRAs for imperfect loops.clos
    corecore